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Abstract 



The Partially Observable Markov Decision Process has long been recognized as a rich frame- 
work for real-world planning and contiol problems, especially in robotics. However exact solu- 
tions in this framework are typically computationally intractable for all but the smallest problems. 
A well-known technique for speeding up POMDP solving involves performing value backups at 
specific behef points, rather than over the entire belief simplex. The efficiency of this approach, 
however, depends greatly on the selection of points. This paper presents a set of novel techniques 
for selecting informative belief points which work well in practice. The point selection procedure 
is combined with point-based value backups to form an effective anytime POMDP algorithm called 
Point-Based Value Iteration (PBVl). The first aim of this paper is to introduce this algorithm and 
present a theoretical analysis justifying the choice of belief selection technique. The second aim of 
this paper is to provide a thorough empirical comparison between PBVI and other state-of-the-art 
POMDP methods, in particular the Perseus algorithm, in an effort to highlight their similarities and 
differences. Evaluation is performed using both standard POMDP domains and realistic robotic 
tasks. 

1. Introduction 

The concept of planning has a long tradition in the AI literature (Fikes & Nilsson, 1971; Chapman, 
1987; McAllester & Roseblitt, 1991; Penberthy & Weld, 1992; Blum & Furst, 1997). Classical 
planning is generally concerned with agents which operate in environments that are fully observable, 
deterministic, finite, static, and discrete. While these techniques are able to solve increasingly 
large state-space problems, the basic assumptions of classical planning — full observability, static 
environment, deterministic actions — make these unsuitable for most robotic applications. 

Planning under uncertainty aims to improve robustness by explicitly reasoning about the type of 
uncertainty that can arise. The Partially Observable Markov Decision Process (POMDP) (Astrom, 
1965; Sondik, 1971; Monahan, 1982; White, 1991; Lovejoy, 1991b; Kaelbling, Littman, & Cassan- 
dra, 1998; Boutilier, Dean, & Hanks, 1999) has emerged as possibly the most general representation 
for (single-agent) planning under uncertainty. The POMDP supersedes other frameworks in terms 
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of representational power simply because it combines the most essential features for planning under 
uncertainty. 

First, POMDPs handle uncertainty in both action effects and state observability, whereas many 
other frameworks handle neither of these, and some handle only stochastic action effects. To han- 
dle partial state observability, plans are expressed over information states, instead of world states, 
since the latter ones are not directly observable. The space of information states is the space of all 
beliefs a system might have regarding the world state. Information states are easily calculated from 
the measurements of noisy and imperfect sensors. In POMDPs, information states are typically 
represented by probability distributions over world states. 

Second, many POMDP algorithms form plans by optimizing a value function. This is a power- 
ful approach to plan optimization, since it allows one to numerically trade off between alternative 
ways to satisfy a goal, compare actions with different costs/rewards, as well as plan for multiple 
interacting goals. While value function optimization is used in other planning approaches — for ex- 
ample Markov Decision Processes (MDPs) (Bellman, 1957) — POMDPs are unique in expressing 
the value function over information states, rather than world states. 

Finally, whereas classical and conditional planners produce a sequence of actions, POMDPs 
produce a full policy for action selection, which prescribes the choice of action for any possible 
information state. By producing a universal plan, POMDPs alleviate the need for re-planning, and 
allow fast execution. Naturally, the main drawback of optimizing a universal plan is the computa- 
tional complexity of doing so. This is precisely what we seek to alleviate with the work described 
in this paper 

Most known algorithms for exact planning in POMDPs operate by optimizing the value function 
over all possible information states (also known as beliefs). These algorithms can run into the well- 
known curse of dimensionality, where the dimensionality of planning problem is directly related to 
the number of states (Kaelbling et al., 1998). But they can also suffer from the lesser known curse 
of history, where the number of belief-contingent plans increases exponentially with the planning 
horizon. In fact, exact POMDP planning is known to be PSPACE-complete, whereas propositional 
planning is only NP-complete (Littman, 1996). As a result, many POMDP domains with only a few 
states, actions and sensor observations are computationally intractable. 

A commonly used technique for speeding up POMDP solving involves selecting a finite set 
of belief points and performing value backups on this set (Sondik, 1971; Cheng, 1988; Lovejoy, 
1991a; Hauskrecht, 2000; Zhang & Zhang, 2001). While the usefulness of belief point updates 
is well acknowledged, how and when these backups should be applied has not been thoroughly 
explored. 

This paper describes a class of Point-Based Value Iteration (PBVI) POMDP approximations 
where the value function is estimated based strictly on point-based updates. In this context, the 
choice of points is an integral part of the algorithm, and our approach interleaves value backups 
with steps of belief point selection. One of the key contributions of this paper is the presentation 
and analysis of a set of heuristics for selecting informative belief points. These range from a naive 
version that combines point-based value updates with random belief point selection, to a sophisti- 
cated algorithm that combines the standard point-based value update with an estimate of the error 
bound between the approximate and exact solutions to select belief points. Empirical and theoret- 
ical evaluation of these techniques reveals the importance of taking distance between points into 
consideration when selecting belief points. The result is an approach which exhibits good perfor- 
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mance with very few belief points (sometimes less than the number of states), thereby overcoming 
the curse of history. 

The PBVI class of algorithms has a number of important properties, which are discussed at 
greater length in the paper: 

• Theoretical guarantees. We present a bound on the error of the value function obtained by 
point-based approximation, with respect to the exact solution. This bound appUes to a number 
of point-based approaches, including our own PBVI, Perseus (Spaan & Vlassis, 2005), and 
others. 

• Scalability. We are able to handle problems on the order of 10^ states, which is an or- 
der of magnitude larger than problems solved by more traditional POMDP techniques. The 
empirical performance is evaluated extensively in realistic robot tasks, including a search-for- 
missing-person scenario. 

• Wide applicability. The approach makes few assumptions about the nature or structure of the 
domain. The PBVI framework does assume known discrete state/ action/observation spaces 
and a known model (i.e., state-to- state transitions, observation probabilities, costs/rewards), 
but no additional specific structure (e.g., constrained policy class, factored model). 

• Anytime performance. An anytime solution can be achieved by gradually alternating phases 
of belief point selection and phases of point-based value updates. This allows for an effective 
trade-off between planning time and solution quality. 

While PBVI has many important properties, there are a number of other recent POMDP ap- 
proaches which exhibit competitive performance (Braziunas & Boutilier, 2004; Poupart & Boutilier, 
2004; Smith & Simmons, 2004; Spaan & Vlassis, 2005). We provide an overview of these tech- 
niques in the later part of the paper. We also provide a comparative evaluation of these algorithms 
and PBVI using standard POMDP domains, in an effort to guide practitioners in their choice of 
algorithm. One of the algorithms, Perseus (Spaan & Vlassis, 2005), is most closely related to PBVI 
both in design and in performance. We therefore provide a direct comparison of the two approaches 
using a realistic robot task, in an effort to shed further light on the comparative strengths and weak- 
nesses of these two approaches. 

The paper is organized as follows. Section 2 begins by exploring the basic concepts in POMDP 
solving, including representation, inference, and exact planning. Section 3 presents the general 
anytime PBVI algorithm and its theoretical properties. Section 4 discusses novel strategies to se- 
lect good belief points. Section 6 presents an empirical comparison of POMDP algorithms using 
standard simulation problems. Section 7 pursues the empirical evaluation by tackling complex robot 
domains and directly comparing PBVI with Perseus. Finally, Section 5 surveys a number of existing 
POMDP approaches that are closely related to PBVI. 

2. Review of POMDPs 

Partially Observable Markov Decision Processes provide a general planning and decision-making 
framework for acting optimally in partially observable domains. They are well-suited to a great 
number of real-world problems where decision-making is required despite prevalent uncertainty. 
They generally assume a complete and correct world model, with stochastic state transitions, im- 
perfect state tracking, and a reward structure. Given this information, the goal is to find an action 
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strategy which maximizes expected reward gains. This section first establishes the basic terminol- 
ogy and essential concepts pertaining to POMDPs, and then reviews optimal techniques for POMDP 
planning. 

2.1 Basic POMDP Terminology 

Formally, a POMDP is defined by six distinct quantities, denoted {S, A, Z, T, O, R}. The first three 
of these are: 

• States. The state of the world is denoted s, with the finite set of all states denoted hy S = 
{sq, Si, . . .}. The state at time t is denoted st, where t is a discrete time index. The state is 
not directly observable in POMDPs, where an agent can only compute a belief over the state 
space S. 

• Observations. To infer a beUef regarding the world's state s, the agent can take sensor mea- 
surements. The set of all measurements, or observations, is denoted Z = {zq, zi, . . .}. The 
observation at time t is denoted zt. Observation zt is usually an incomplete projection of the 
world state st, contaminated by sensor noise. 

• Actions. To act in the world, the agent is given a finite set of actions, denoted A = 
{ao, ai, . . .}. Actions stochastically affect the state of the world. Choosing the right action as 
a function of history is the core problem in POMDPs. 

Throughout this paper, we assume that states, actions and observations are discrete and finite. 
For mathematical convenience, we also assume that actions and observations are alternated over 
time. 

To fully define a POMDP, we have to specify the probabilistic laws that describe state transitions 
and observations. These laws are given by the following disttibutions: 

• The state transition probability distribution, 

T{s,a,s') := Pr{st = s' \ st-i = s,at-i = a) \/t, (1) 

is the probability of transitioning to state s', given that the agent is in state s and se- 
lects action a, for any (s, a, s'). Since T is a conditional probability distribution, we have 
J2s'eS ^(•5) s') = 1, V(s, a). As our notation suggests, T is time-invariant. 

• The observation probability distribution, 

0{s, a, z) := Pr{zt = z \ = s, at-\ = a) Vt, (2) 

is the probability that the agent will perceive observation z upon executing action a in state s. 
This conditional probability is defined for all (s, a, z) triplets, for which Y^^ez 0{s, a, z) = 
1, V(s, a). The probability function O is also time-invariant. 

Finally, the objective of POMDP planning is to optimize action selection, so the agent is given 
a reward function describing its performance: 
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• The reward function. R{s,a) : S x A — > assigns a numerical value quantifying the 
utility of performing action a when in state s. We assume the reward is bounded, Rmin < 
R < Rmax- The goal of the agent is to collect as much reward as possible over time. More 
precisely, it wants to maximize the sum: 

T 

E[J2^'-'"rt], (3) 

t=to 

where rt is the reward at time t, E[]is the mathematical expectation, and 7 where < 7 < 1 
is a discount factor, which ensures that the sum in Equation 3 is finite. 

These items together, the states S, actions A, observations Z, reward R, and the probabihty 
distributions, T and O, define the probabihstic world model that underhes each POMDP. 

2.2 Belief Computation 

POMDPs are instances of Markov processes, which implies that the current world state, st, is suf- 
ficient to predict the future, independent of the past {sq, si, sj-i}. The key characteristic that 
sets POMDPs apart from many other probabilistic models (such as MDPs) is the fact that the state 
St is not directly observable. Instead, the agent can only perceive observations {zi,. . . ,zt}, which 
convey incomplete information about the world's state. 

Given that the state is not directly observable, the agent can instead maintain a complete trace 
of all observations and all actions it ever executed, and use this to select its actions. The ac- 
tion/observation trace is known as a history. We formally define 

ht := {ao,zi,...,zt-i,at-i,zt} (4) 

to be the history at time t. 

This history trace can get very long as time goes on. A well-known fact is that this history 
does not need to be represented explicitly, but can instead be summarized via a belief distribu- 
tion (Astrom, 1965), which is the following posterior probability distribution: 

bt{s) := Pr{st = s \ zt,at-i,zt-i,...,ao,bo). (5) 

This of course requires knowing the initial state probability distribution: 

6o(s) := Pr(so = s), (6) 

which defines the probabihty that the domain is in state s at time t = 0. It is common either to 

specify this initial belief as part of the model, or to give it only to the runtime system which tracks 
beliefs and selects actions. For our work, we will assume that this initial belief (or a set of possible 
initial beliefs) are available to the planner. 

Because the belief distribution bt is a sufficient statistic for the history, it suffices to condition 
the selection of actions on bt, instead of on the ever-growing sequence of past observations and 
actions. Furthermore, the belief bt at time t is calculated recursively, using only the belief one time 
step earlier, bt-i, along with the most recent action at-i and observation zt. 
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We define the belief update equation, r(), as: 

T{bt-i,at-i,zt) = bt{s') 

^ 0(s', at-i,zt) T{s, at-i,s') h-iis) 

= — (7) 

Pr{zt\bt-i,at-i) 

where the denominator is a normaUzing constant. 

This equation is equivalent to the decades-old Bayes filter (Jazwinski, 1970), and is commonly 
applied in the context of hidden Markov models (Rabiner, 1989), where it is known as the forward 
algorithm. Its continuous generaUzation forms the basis of Kalman filters (Kalman, 1960). 

It is interesting to consider the nature of belief distributions. Even for finite state spaces, the 
belief is a continuous quantity. It is defined over a simplex describing the space of all distributions 
over the state space S. For very large state spaces, calculating the belief update (Eqn 7) can be com- 
putationally challenging. Recent research has led to efficient techniques for beUef state computation 
that exploit structure of the domain (Dean & Kanazawa, 1988; Boyen & KoUer, 1998; Poupart & 
Boutilier, 2000; Thrun, Fox, Burgard, & Dellaert, 2000). However, by far the most complex as- 
pect of POMDP planning is the generation of a policy for action selection, which is described next. 
For example in robotics, calculating beliefs over state spaces with 10^ states is easily done in real- 
time (Burgard et al., 1999). In contrast, calculating optimal action selection policies exactly appears 
to be infeasible for environments with more than a few dozen states (Kaelbling et al., 1998), not 
directly because of the size of the state space, but because of the complexity of the optimal policies. 
Hence we assume throughout this paper that the belief can be computed accurately, and instead 
focus on the problem of finding good approximations to the optimal policy. 



2.3 Optimal Policy Computation 

The central objective of the POMDP perspective is to compute a policy for selecting actions. A 
policy is of the form: 

7r(6) a, (8) 

where 6 is a belief distribution and a is the action chosen by the policy vr. 

Of particular interest is the notion of optimal policy, which is a policy that maximizes the ex- 
pected future discounted cumulative reward: 









7r*(5tJ = argmaxE'^ 








t=tQ 





There are two distinct but interdependent reasons why computing an optimal poUcy is challeng- 
ing. The more widely-known reason is the so-called curse of dimensionality: in a problem with 
n physical states, vr is defined over all belief states in an (n — 1) -dimensional continuous space. 
The less- well-known reason is the curse of history: POMDP solving is in many ways like a search 
through the space of possible POMDP histories. It starts by searching over short histories (through 
which it can select the best short policies), and gradually considers increasingly long histories. Un- 
fortunately the number of distinct possible action-observation histories grows exponentially with 
the planning horizon. 
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The two curses — dimensionality and history — often act independently: planning complexity 
can grow exponentially with horizon even in problems with only a few states, and problems with a 
large number of physical states may still only have a small number of relevant histories. Which curse 
is predominant depends both on the problem at hand, and the solution technique. For example, the 
belief point methods that are the focus of this paper specifically target the curse of history, leaving 
themselves vulnerable to the curse of dimensionality. Exact algorithms on the other hand typically 
suffer far more from the curse of history. The goal is therefore to find techniques that offer the best 
balance between both. 

We now describe a straightforward approach to finding optimal policies by Sondik (1971). The 

overall idea is to apply multiple iterations of dynamic programming, to compute increasingly more 
accurate values for each belief state b. Let F be a value function that maps belief states to values in 
Beginning with the initial value function: 

Vo{b) = max^i?(s,a)6(s), (10) 

" ses 

then the t-th value function is constructed from the {t — l)-th by the following recursive equation: 



Vt{b) = max 



J2 a)bis) + 7 ^ Priz \ a, 6)14_i(r(6, a, z)) 
.ses zez 



(11) 



where r(6, a, z) is the behef updating function defined in Equation 7. This value function update 

maximizes the expected sum of all (possibly discounted) future pay-offs the agent receives in the 
next t time steps, for any belief state b. Thus, it produces a policy that is optimal under the planning 
horizon t. The optimal policy can also be directly extracted from the previous-step value function: 



7rj(6) = argmax 



^ R{s, a)b{s) + 7 ^ Pr{z \ a, 6)T4_i(r(6, a, z)) 
LseS zez 



(12) 



Sondik (1971) showed that the value function at any finite horizon t can be expressed by a set 
of vectors: Tt = {ao, ai, . . . , a„i}- Each a-vector represents an l^j -dimensional hyper-plane, and 
defines the value function over a bounded region of the belief: 

Vt{b) = max Va(s)6(s). (13) 

In addition, each a-vector is associated with an action, defining the best immediate policy 
assuming optimal behavior for the following [t — 1) steps (as defined respectively by the sets 
{Vt-,,...,Vo}). 

The t-horizon solution set, Tt, can be computed as follows. First, we rewrite Equation 1 1 as: 



Vt{b) = max 



seS zez°'^ ^-^seSs'eS 



■ (14) 



Notice that in this representation of Vt{b), the nonlinearity in the term P{z\a, b) from Equation 11 
cancels out the nonlinearity in the term r(6, a, z), leaving a linear function of b(s) inside the max 
operator. 
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The value Vt (6) cannot be computed directly for each belief b & B (since there are infinitely 
many beliefs), but the corresponding set Ft can be generated through a sequence of operations on 
the set Tt-i. 

The first operation is to generate intermediate sets F"'* and F"''^, Va & A,'iz & Z (Step 1): 

F"'* ^ a'^'^s) = R{s,a) (15) 

F^'^ a"'^(s) = 7 ^ r(s,a, s')0(s',a,z)ai(s'),VQ:i G Ft_i 

s'eS 

where each a"-'* and a^'^ is once again an 1 6" | -dimensional hyper-plane. 

Next we create F^ (Va G A), the cross-sum over observations^, which includes one a"'^ from 
eachF"'^C5?e;7 2j: 

pa ^ pa,* _^ pa,2i ^ ^a,Z2 ^ _ _ 

Finally we take the union of F" sets (Step 3): 

Ft = Ua6AF^ (17) 



This forms the pieces of the backup solution at horizon t. The actual value function Vt is 
extracted from the set Ft as described in Equation 13. 

Using this approach, bounded-time POMDP problems with finite state, action, and observation 
spaces can be solved exactly given a choice of the horizon T. If the environment is such that the 
agent might not be able to bound the planning horizon in advance, the policy vr^ (b) is an approxima- 
tion to the optimal one whose quality improves in expectation with the planning horizon t (assuming 
< 7 < 1). 

As mentioned above, the value function Vt can be extracted directly from the set F^. An im- 
portant aspect of this algorithm (and of all optimal finite-horizon POMDP solutions) is that the 
value function is guaranteed to be a piecewise linear, convex, and continuous function of the be- 
Uef (Sondik, 1971). The piecewise-linearity and continuous properties are a direct result of the fact 
that Vt is composed of finitely many Unear cc-vectors. The convexity property is a result of the 
maximization operator (Eqn 13). It is worth pointing out that the intermediate sets F^'^, F"'* and Ff 
also represent functions of the belief which are composed entirely of linear segments. This property 
holds for the intermediate representations because they incorporate the expectation over observation 
probabilities (Eqn 15). 

In the worst case, the exact value update procedure described could require time doubly ex- 
ponential in the planning horizon T (Kaelbling et al., 1998). To better understand the complexity 
of the exact update, let |5| be the number of states, |^| the number of actions, \Z\ the number of 
observations, and |Fj_i| the number of a-vectors in the previous solution set. Then Step 1 creates 
1^1 \ Z\ |F(_i| projections and Step 2 generates \ A\ iFt-ill'^l cross-sums. So, in the worst case, the 
new solution requires: 

\Ft\ = Oi\A\\Ft-i\\^\) (18) 

1. The symbol ® denotes the cross-sum operator. A cross-sum operation is defined over two sets, A = 
{tti, 02, ... , am} and B — {61, 62, • • • , &«}, and produces a third set, C = {ai + 61, oi -|- 62, ■ ■ ■ , ai -|- 6„, 02 -|- 

bi,a2 + b2, ,am + 6„}. 
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a-vectors to represent the value function at horizon t; these can be computed in time 

o(|5|2|^||ri_i|i^i). 

It is often the case that a vector in Tt will be completely dominated by another vector over the 
entire belief simplex: 

ai-b<aj- b, V6. (19) 

Similarly, a vector may be fully dominated by a set of other vectors (e.g., 02 in Fig- 1 is dom- 
inated by the combination of ai and as). This vector can then be pruned away without affecting 
the solution. Finding dominated vectors can be expensive. Checking whether a single vector is 
dominated requires solving a linear program with |5| variables and |rt| constraints. Nonetheless it 
can be time-effective to apply pruning after each iteration to prevent an explosion of the solution 
size. In practice, |rt| often appears to grow singly exponentially in t, given clever mechanisms for 
pruning unnecessary linear functions. This enormous computational complexity has long been a 
key impediment toward applying POMDPs to practical problems. 




Figure 1 : POMDP value function representation 
2.4 Point-Based Value Backup 

Exact POMDP solving, as outlined above, optimizes the value function over all beliefs. Many 
approximate POMDP solutions, including the PBVI approach proposed in this paper, gain compu- 
tational advantage by applying value updates at specific (and few) behef points, rather than over all 
beliefs (Cheng, 1988; Zhang & Zhang, 2001; Poon, 2001). These approaches differ significantly 
(and to great consequence) in how they select the behef points, but once a set of points is selected, 
the procedure for updating their value is standard. We now describe the procedure for updating the 
value function at a set of known behef points. 

As in Section 2.3, the value function update is implemented as a sequence of operations on a 
set of a-vectors. If we assume that we are only interested in updating the value function at a fixed 
set of belief points, B = {bo, bi, bq}, then it follows that the value function will contain at most 
one a-vector for each behef point. The point-based value function is therefore represented by the 
corresponding set {cto, ai, . . . , Og}. 

Given a solution set F^-i, we simply modify the exact backup operator (Eqn 14) such that only 
one Q-vector per belief point is maintained. The point-based backup now gives an a-vector which 
is valid over a region around b. It assumes that the other belief points in that region have the same 
action choice and lead to the same facets of Vt-i as the point b. This is the key idea behind all 
algorithms presented in this paper, and the reason for the large computational savings associated 
with this class of algorithms. 
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To obtain solution set Tt from the previous set Tt-i, we begin once again by generating inter- 
mediate sets r"'* and T"'^, Va e A,Wz E Z (exactly as in Eqn 15) (Step 1 ): 

r^'* ^ a«'*(s) = i?(s,a) (20) 

r«'^ ^ a'^^\s)=jY.T{s,a,s')0{s',a,z)ai{s')yaiert-i. 
s'es 

Next, whereas performing an exact value update requires a cross-sum operation (Eqn 16), by 
operating over a finite set of points, we can instead use a simple summation. We construct F^, Va G 
A (Step 2): 

^ a^ = r"'* + ^argmax(^a(s)6(s)),V6GB. (21) 

Finally, we find the best action for each belief point (Step 3): 

ab = argmax(Vr^(s)6(s)), V6 G (22) 

Ft = Ubesab (23) 

While these operations preserve only the best a-vector at each belief point 6 G S, an estimate 
of the value function at any belief in the simplex (including b ^ B) can be extracted from the set 
just as before: 

Vt{b) = max Va(s)6(s). (24) 

ses 

To better understand the complexity of updating the value of a set of points B, let l^l be the 
number of states, |^| the number of actions, \Z\ the number of observations, and |rt_i| the number 
of a-vectors in the previous solution set. As with an exact update. Step 1 creates \A\ \Z\ \Tt-i\ 
projections (in time |5p |^| l^'l |rt_i|). Steps 2 and 3 then reduce this set to at most \B\ components 
(in time l^l |^| \Ft-i \ \Z\ \B\). Thus, a full point-based value update takes only polynomial time, 
and even more crucially, the size of the solution set Ft remains constant at every iteration. The 
point-based value backup algorithm is summarized in Table 1. 

Note that the algorithm as outlined in Table 1 includes a trivial pruning step (lines 13-14), 
whereby we refrain from adding to F^ any vector already included in it. As a result, it is often the 
case that |Ft| < \B\. This situation arises whenever multiple nearby belief points support the same 
vector. This pruning step can be computed rapidly (without solving hnear programs) and is clearly 
advantageous in terms of reducing the set F^. 

The point-based value backup is found in many POMDP solvers, and in general serves to im- 
prove estimates of the value function. It is also an integral part of the PBVI framework. 

3. Anytime Point-Based Value Iteration 

We now describe the algorithmic framework for our new class of fast approximate POMDP algo- 
rithms called Point-Based Value Iteration (PBVI). PBVl-class algorithms offer an anytime solution 
to large-scale discrete POMDP domains. The key to achieving an anytime solution is to interleave 
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Tt=BACK\JP{B, Tt_i) 1 

For each action a & A 2 

For each observation z e Z 3 

For each solution vector Oi € Tt-i 4 

=-fJ2s'esT{s,a,s')0{s',a,z)ai{s'),ys e S 5 

End 6 

r"'^ = Ui a"'^ 7 

End 8 

End 9 

r* = 10 

For each belief point b e B 11 



ab = argmax„g^ 



If(afe ^ Tj) 13 

Ft = Ft U aft 14 

End 15 

Return Tt 16 

Table 1 : Point-based value backup 



two main components: the point-based update described in Table 1 and steps of belief set selec- 
tion. The approximate value function we find is guaranteed to have bounded error (compared to the 
optimal) for any discrete POMDP domain. 

The current section focuses on the overall anytime algorithm and its theoretical properties, in- 
dependent of the belief point selection process. Section 4 then discusses in detail various novel 
techniques for belief point selection. 

The overall PBVl framework is simple. We start with a (small) initial set of belief points to 
which are applied a first series of backup operations. The set of beUef points is then grown, a 
new series of backup operations are applied to all belief points (old and new), and so on, until a 
satisfactory solution is obtained. By interleaving value backup iterations with expansions of the 
belief set, PBVl offers a range of solutions, gradually trading off computation time and solution 
quality. 

The full algorithm is presented in Table 2. The algorithm accepts as input an initial belief point 
set (Bjnit), an initial value (Fq), the number of desired expansions (N), and the planning horizon 
(T). A common choice for Bj^u is the initial belief bo; alternately, a larger set could be used, 
especially in cases where sample trajectories are available. The initial value, Fq, is typically set to 
be purposefully low (e.g., ao(s) = , Vs G 5). When we do this, we can show that the point- 
based solution is always be a lower-bound on the exact solution (Lovejoy, 1991a). This follows 
from the simple observation that failing to compute an a- vector can only lower the value function. 

For problems with a finite horizon, we run T value backups between each expansion of the 
belief set. In infinite-horizon problems, we select the horizon T so that 

j'^ [itlmax - i^min] < €, (25) 

where i?max = max^^a R{s, a) and Rmin = ^^^s,a R(s, a). 
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The complete algorithm terminates once a fixed number of expansions (A^) have been com- 
pleted. Alternately, the algorithm could terminate once the value function approximation reaches a 
given performance criterion. This is discussed further below. 

The algorithm uses the BACKUP routine described in Table 1 . We can assume for the moment 
that the EXPAND subroutine (line 8) selects behef points at random. This performs reasonably 
well for small problems where it is easy to achieve good coverage of the entire belief simplex. 
However it scales poorly to larger domains where exponentially many points are needed to guarantee 
good coverage of the behef simplex. More sophisticated approaches to selecting belief points are 
presented in Section 4. Overall, the PBVI framework described here offers a simple yet flexible 
approach to solving large-scale POMDPs. 



r=PBVI-MAIN(S/„it, To, N, T) 1 



B=Binit 2 

r = ro 3 

For N expansions 4 

For T iterations 5 

r =BACKUP(S,r) 6 

End 7 

Bnew =EXPAND(S,r) 8 

B = B U Bnew 9 

End 10 

Return T 1 1 



Table 2: Algorithm for Point-Based Value Iteration (PBVI) 



For any belief set B and horizon t, the algorithm in Table 2 will produce an estimate of the value 
function, denoted V^^. We now show that the error between V^^ and the optimal value function V* 
is bounded. The bound depends on how densely B samples the belief simplex A; with denser 
sampling, V^^ converges to V*, the t-horizon optimal solution, which in turn has bounded error 
with respect to V*, the optimal solution. So cutting off the PBVI iterations at any sufficiently large 
horizon, we can show that the difference between V^^ and the optimal infinite-horizon V* is not too 
large. The overall error in PBVI is bounded, according to the triangle inequality, by: 

ll^^t^ - F*||oo < \\Vf - V*\\oo + \\Vt* - F*||oo. (26) 

The second term is bounded by 7* || V^o* ~ ^* II (Bertsekas & Tsitsikhs, 1996). The remainder of this 
section states and proves a bound on the first term, which we denote e^. 

Begin by assuming that H denotes an exact value backup, and H denotes the PBVI backup. 
Now define e{b) to be the error introduced at a specific belief 6 G A by performing one iteration of 
point-based backup: 

e{b) = \HV^{b)-HV''{b)\^. 
Next define e to be the maximum total error introduced by doing one iteration of point-based backup: 

e = \HV^-HV^\oo 

= maxe(6). 
feeA ^ ' 
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Finally define the density 6b of a set of belief points B to be the maximum distance from any belief 
in the simplex A to a belief in set B. More precisely: 

(5r = maxmin 116 — (27) 

fe'eA b&B 

Now we can prove the following lemma: 

Lemma 1. The error introduced in PBVI when performing one iteration of value backup over B, 
instead of over A, is bounded by 

1-7 

Proof: Let 6' G A be the point where PBVI makes its worst error in value update, and b ^ B 
be the closest (1-norm) sampled belief to b'. Let a be the vector that is maximal at b, and a' be the 
vector that would be maximal at b'. By failing to include a' in its solution set, PBVI makes an error 
of at most a' • 6' — a • 6'. On the other hand, since a is maximal at b, then a' ■ b < a ■ b. So, 

e < a' -b' -a-b' 

= a' ■ b' - a ■ b' + (a' ■ b - a' ■ b) Add zero 

< a' ■ b' — a ■ b' + a ■ b — a' ■ b Assume a is optimal at b 
= {a' — a) ■ {b' — b) Re-arrange the terms 

< lla' — a||oo||6' — b\\i By Holder inequality 

< ||a' — alloo'^s By definition of (5b 

^ 1-7 

The last inequality holds because each a- vector represents the reward achievable starting from 
some state and following some sequence of actions and observations. Therefore the sum of rewards 
must fall between and □ 

Lemma 1 states a bound on the approximation error introduced by one iteration of point-based 
value updates within the PBVI framework. We now look at the bound over multiple value updates. 

Theorem 3.1. For any belief set B and any horizon t, the error of the PBVI algorithm = 1 1 V^^ — 
V^* II oo is bounded by 

(-Rmax — ^min)'^^ 



et< 



(1-7)^ 



Proof: 



= \\Vf-Vt*\\oo 

= I |#Fi?i - HV^*_i\ loo By definition of H 

< I I^Ft^i - -fft^t^i I loo + I I^V^t-i - HVt*_^ I loo By triangle inequality 

< (^■"ax-flmin)^B ^ I li/yB^ _ HV^*_-^\\oo By lemma 1 

+ 7 1 1 1 — Vf*_ 1 1 1 00 By contraction of exact value backup 

+ jet-i By definition of et-i 

< vxvmax-^min7"B By sum of 3 gcometdc scdes □ 



1 


-7 




(-^max 


Rmi 




1 


-7 




(-Rmax 


-Rmi 


n)<5_B 
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-7 
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-Rmi 
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The bound described in this section depends on how densely B samples the belief simplex A. 
In the case where not all beliefs are reachable, PBVI does not need to sample all of A densely, but 
can replace A by the set of reachable beliefs A (Fig. 2). The error bounds and convergence results 
hold on A. We simply need to re-define 6' G A in lemma 1 . 

As a side note, it is worth pointing out that because PBVI makes no assumption regarding 
the initial value function V^^, the point-based solution is not guaranteed to improve with the 
addition of belief points. Nonetheless, the theorem presented in this section shows that the bound 
on the error between V^^ (the point-based solution) and V* (the optimal solution) is guaranteed 
to decrease (or stay the same) with the addition of belief points. In cases where V^^ is initialized 
pessimistically (e.g., V(f{s) = ^j^,ys G S, as suggested above), then Vf^ will improve (or stay 
the same) with each value backup and addition of belief points. 

This section has thus far skirted the issue of belief point selection, however the bound presented 
in this section clearly argues in favor of dense sampUng over the belief simplex. While randomly 
selecting points according to a uniform distribution may eventually accomplish this, it is generally 
inefficient, in particular for high dimensional cases. Furthermore, it does not take advantage of 
the fact that the error bound holds for dense sampling over reachable beliefs. Thus we seek more 
efficient ways to generate belief points than at random over the entire simplex. This is the issue 
explored in the next section. 

4. Belief Point Selection 

In section 3, we outlined the prototypical PBVI algorithm, while conveniently avoiding the question 
of how and when belief points should be selected. There is a clear trade-off between including fewer 
beliefs (which would favor fast planning over good performance), versus including many beliefs 
(which would slow down planning, but ensure a better bound on performance). This brings up the 
question of how many belief points should be included. However the number of points is not the only 
consideration. It is likely that some collections of belief points (e.g., those frequently encountered) 
are more likely to produce a good value function than others. This brings up the question of which 
beliefs should be included. 

A number of approaches have been proposed in the literature. For example, some exact value 
function approaches use linear programs to identify points where the value function needs to be 
further improved (Cheng, 1988; Littman, 1996; Zhang & Zhang, 2001), however this is typically 
very expensive. The value function can also be approximated by learning the value at regular points, 
using a fixed-resolution (Lovejoy, 1991a), or variable-resolution (Zhou & Hansen, 2001) grid. This 
is less expensive than solving LPs, but can scales poorly as the number of states increases. Alter- 
nately, one can use heuristics to generate grid-points (Hauskrecht, 2000; Poon, 2001). This tends 
to be more scalable, though significant experimentation is required to establish which heuristics are 
most useful. 

This section presents five heuristic strategies for selecting belief points, from fast and naive 
random sampling, to increasingly more sophisticated stochastic simulation techniques. The most 
effective strategy we propose is one that carefully selects points that are likely to have the largest 
impact in reducing the error bound (Theorem 3.1). 

Most of the strategies we consider focus on selecting reachable beliefs, rather than getting 
uniform coverage over the entire belief simplex. Therefore it is useful to begin this discussion by 
looking at how reachability is assessed. 
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While some exact POMDP value iteration solutions are optimal for any initial belief, PBVI (and 
other related techniques) assume a known initial belief bo. As shown in Figure 2, we can use the 
initial belief to build a tree of reachable beliefs. In this representation, each path through the tree 
corresponds to a sequence in belief space, and increasing depth corresponds to an increasing plan 
horizon. When selecting a set of belief points for PBVI, including all reachable beliefs would guar- 
antee optimal performance (conditioned on the initial belief), but at the expense of computational 
tractability, since the set of reachable beliefs. A, can grow exponentially with the planning hori- 
zon. Therefore, it is best to select a subset B c A which is sufficiently small for computational 
tractability, but sufficiently large for good value function approximation.^ 



u 






Figure 2: The set of reachable beliefs 

In domains where the initial belief is not known (or not unique), it is still possible to use reach- 
ability analysis by sampling a few initial beliefs (or using a set of known initial beliefs) to seed 
multiple reachability trees. 

We now discuss five strategies for selecting belief points, each of which can be used within the 
PBVI framework to perform expansion of the belief set. 

4.1 Random Belief Selection (RA) 

The first strategy is also the simplest. It consists of sampling belief points from a uniform distri- 
bution over the entire belief simplex. To sample over the simplex, we cannot simply sample each 
b{s) independently over [0, 1] (this would violate the constraint that J2s K^) = Instead, we use 
the algorithm described in Table 3 (see Devroye, 1986, for more details including proof of uniform 
coverage). 

This random point selection strategy, unlike the other strategies presented below, does not focus 
on reachable beliefs. For this reason, we do not necessarily advocate this approach. However we 
include it because it is an obvious choice, it is by far the simplest to implement, and it has been used 
in related work by Hauskrecht (2000) and Poon (2001). In smaller domains (e.g., <20 states), it 

2. All strategies discussed below assume that the belief point set, B, approximately doubles in size on each belief 
expansion. This ensures that the number of rounds of value iteration is logarithmic (in the final number of belief 
points needed). Alternately, each strategy could be used (with very little modification) to add a fixed number of new 
belief points, but this may require many more rounds of value iteration. Since value iteration is much more expensive 
than belief computation, it seems appropriate to double the size of B at each expansion. 
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1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 



Foreach b E B 

S := number of states 
For i = : S* 



End 



Sort btmp in ascending order 



For i = 1 : S* - 1 




End 

Return B, 



Table 3: Algorithm for belief expansion with random action selection 



performs reasonably well, since the belief simplex is relatively low-dimensional. In large domains 
(e.g., 100+ states), it cannot provide good coverage of the belief simplex with a reasonable number 
of points, and therefore exhibits poor performance. This is demonstrated in the experimental results 
presented in Section 6. 

All of the remaining belief selection strategies make use of the belief tree (Figure 2) to focus on 
reachable beliefs, rather than trying to cover the entire belief simplex. 

4.2 Stochastic Simulation witli Random Action (SSRA) 

To generate points along the belief tree, we use a technique called stochastic simulation. It involves 
running single-step forward trajectories from belief points already in B. Simulating a single-step 
forward trajectory for a given b e B requires selecting an action and observation pair (a, z), and 
then computing the new belief t(6, a, z) using the Bayesian update rule (Eqn 7). In the case of 
Stochastic Simulation with Random Action (SSRA), the action selected for forward simulation is 
picked (uniformly) at random from the full action set. Table 4 summarizes the belief expansion 
procedure for SSRA. First, a state s is drawn from the belief distribution b. Second, an action a 
is drawn at random from the full action set. Next, a posterior state s' is drawn from the transition 
model T{s,a,s'). Finally, an observation z is drawn from the observation model 0{s',a,z). Using 
the triple (6, a, z), we can calculate the new belief bnew = ^ib, a, z) (according to Equation 7), and 
add to the set of belief points Bnew 

This strategy is better than picking points at random (as described above), because it restricts 
Bnew to the belief tree (Fig. 2). However this belief tree is still very large, especially when the 
branching factor is high, due to large numbers of actions/observations. By being more selective 
about which paths in the belief tree are explored, one can hope to effectively restrict the belief set 
further. 
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1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 



«=''andinultinomial(^) 



a=randuniform(^) 



s'=randjnultinomial(^(^' O) 

z=randjnuitmomial(0(«'' «' 0) 

bnew = r{b, a, z) (see Eqn 7) 




Table 4: Algorithm for belief expansion with random action selection 



A similar technique for stochastic simulation was discussed by Poon (2001), however the be- 
lief set was initialized differently (not using bo), and therefore the stochastic simulations were not 
restricted to the set of reachable beliefs. 

4,3 Stochastic Simulation with Greedy Action (SSGA) 

The procedure for generating points using Stochastic Simulation with Greedy Action (SSGA) is 
based on the well-known e- greedy exploration strategy used in reinforcement learning (Sutton & 
Barto, 1998). This strategy is similar to the SSRA procedure, except that rather than choosing an 
action randomly, SSEA will choose the greedy action (i.e., the current best action at the given belief 
b) with probability 1 — e, and will chose a random action with probability e (we use e = 0.1). Once 
the action is selected, we perform a single-step forward simulation as in SSRA to yield a new belief 
point. Table 5 summarizes the belief expansion procedure for SSGA. 

A similar technique, featuring stochastic simulation using greedy actions, was outhned 

by Hauskrecht (2000). However in that case, the belief set included all extreme points of the belief 
simplex, and stochastic simulation was done from those extreme points, rather than from the initial 



4.4 Stochastic Simulation with Exploratory Action (SSEA) 

The error bound in Section 3 suggests that PB VI performs best when its belief set is uniformly dense 
in the set of reachable beliefs. The belief point strategies proposed thus far ignore this information. 
The next approach we propose gradually expands B by greedily choosing new reachable beliefs 
that improve the worst-case density. 

Unlike SSRA and SSGA which select a single action to simulate the forward trajectory for 
a given b e B, Stochastic Sampling with Exploratory Action (SSEA) does a one step forward 
simulation with each action, thus producing new beliefs {bag, 6an •••}■ However it does not accept 
all new beliefs {6a„, 6ai, •••}, but rather calculates the Li distance between each ba and its closest 
neighbor in B. We then keep only that point ba that is farthest away from any point already in B. 



belief. 
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Foreach b e B 3 

«=randinultinomial(^) ^ 

If randunifoj.m[0' 1] < ^ ^ 

a=randuijifonn(^) ^ 

Else 7 

a=argmax„£rEse5«(s)^(s) 8 

End 9 

s'=randjnuitinomial(^(«' a, •)) 10 

^^dmultinomialC^l^'' «' O) 1 1 

^new) = T{b, a, z) (scc Eqn 7) 12 

^new ~ ^new U bnew 13 

End 14 

Return S^eiu 15 



Table 5: Algorithm for belief expansion with greedy action selection 



We use the Li norm to calculate distance between belief points to be consistent with the error bound 
in Theorem 3.1. Table 6 summarizes the SSEA expansion procedure. 



5„e«,=EXPANDssEA(^' 1 

Foreach b e B 3 

Foreach a G A 4 

*=i"andinultinomial(^) 5 

s'=randn,uitjnomial(^(«' «' O) 6 

^™djnultinomial(0(s', a, •)) 7 

ba=T{b, a, z) (see Eqn 7) 8 

End 9 
6ne«) = maXaeA minb'eBnew T,seS \ba{s) - b'{s)\ 10 

Bnew = Bnew U bnew (see Eqn 7) 1 1 

End 12 

Return B^ew 13 



Table 6: Algorithm for belief expansion with exploratory action selection 



4,5 Greedy Error Reduction (GER) 

While the SSEA strategy above is able to improve the worst-case density of reachable beliefs, it 
does not directly minimize the expected error. And while we would like to directly minimize the 
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error, all we can measure is a bound on the error (Lemma 1). We therefore propose a final strategy 
which greedily adds the candidate beliefs that will most effectively reduce this error bound. Our 
empirical results, as presented below, show that this strategy is the most successful one discovered 
thus far 

To understand how we expand the belief set in the GER strategy, it is useful to re-consider the 
belief tree, which we reproduce in Figure 3. Each node in the tree corresponds to a specific belief. 
We can divide these nodes into three sets. Set 1 includes those belief points already in B, in this 
case bo and 6aozo- Set 2 contains those belief points that are immediate descendants of the points 
in B (i.e., the nodes in the grey zone). These are the candidates from which we will select the new 
points to be added to B. We call this set the envelope (denoted B). Set 3 contains all other reachable 
beliefs. 






b b b b b b 

^Alj ^^0 ^^1 ^^q 



b b b 



Figure 3: The set of reachable beliefs 

We need to decide which belief b should be removed from the envelope B and added to the set 
of active belief points B. Every point that is added to B will improve our estimate of the value 
function. The new point will reduce the error bounds (as defined in Section 3 for points that were 
already in B; however, the error bound for the new point itself might be quite large. That means that 
the largest error bound for points in B will not monotonically decrease; however, for a particular 
point in B (such as the initial belief bo) the error bound will be decreasing. 

To find the point which will most reduce our error bound, we can look at the analysis of 
Lemma 1 . Lemma 1 bounds the amount of additional error that a single point-based backup in- 
troduces. Write b' for the new belief which we are considering adding, and write b for some belief 
which is already in B. Write a for the value hyper-plane at b, and write a' for b'. As the lemma 
points out, we have 

e{b') < {a' - a) ■ {b' - b) 

When evaluating this error, we need to minimize over all b £ B. Also, since we do not know what 
a' will be until we have done some backups at b', we make a conservative assumption and choose 
the worst-case value of a' G [Rmm/ (1 — 7)) -Rmax/ (1 — 7)]''^'- Thus, we can evaluate: 



e(6') < min 




■a{s))ib'{s) 
amb'is) 



b{s)) b'{s) > bis) 
b{s)) b'{s) < b{s) 



(28) 
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While one could simply pick the candidate b' ^ B which currently has the largest error bound,^ 
e(b'), this would ignore reachability considerations. Rather, we evaluate the error at each b E B,hy 
weighing the error of the fringe nodes by their reachability probability: 

e(6) = max y 0(6, a, z) e(r(6,a,z)) (29) 

= max E E E ^')0{s', a, z)b{s) e{T{b, a, z)), 

''^^ zez \sess'es J 

noting that r(6, a, z) G B, and e(r(6, a, z)) can be evaluated according to Equation 28. 

Using Equation 29, we find the existing point b ^ B with the largest error bound. We can now 
directly reduce its error by adding to our set one of its descendants. We select the next-step belief 
T{b, a, z) which maximizes error bound reduction: 

B = B\JT{b,d,z), (30) 

where b, a := argmax 0{b, a, z) e{T{b, a, z)) (31) 
beB,aeA 

z := argmax 0(6, a, z) e{T{b, a, z)) (32) 
zez 

Table 7 summarizes the GER approach to belief point selection. 



S„e^=EXPANDGER(^, r) 1 

N=\B\ 3 

Fori = l: N 4 

b, a := argmaxjgs J^zez 0{b, a, z) e{T{b, a, z)) 5 

z := argmax^g^ 0{b, a, z) e{T{b, d, z)) 6 

bnew = rib, a, z) 7 

B'new — Bjigyj U bfigyj 8 

End 9 

Return Bnew 10 



Table 7: Algorithm for belief expansion 



The complexity of adding one new points with GER is 0{SAZB) (where S'=#states, 
74=#actions, Z=#observations, i3=#beliefs already selected). In comparison, a value backup (for 
one point) is 0{S'^AZB), and each point typically needs to be updated several times. As we point 
out in empirical results below, belief selection (even with GER) takes minimal time compared to 
value backup. 

This concludes our presentation of belief selection techniques for the PBVI framework. In 
summary, there are three factors to consider when picking a beUef point: (1) how likely is it to 

3. We tried this, however it did not perform as well empirically as what we suggest in Equation 29, because it did not 
consider the probability of reaching that belief. 
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occur? (2) how far is it from other beUef points already selected? (3) what is the current approximate 
value for that point? The simplest heuristic (RA) accounts for none of these, whereas some of the 
others (SSRA, SSGA, SSEA) account for one, and GER incorporates all three factors. 

4.6 Belief Expansion Example 

We consider a simple example, shown in Figure 4, to illustrate the difference between the various 
belief expansion techniques outlined above. This ID POMDP (Littman, 1996) has four states, one 
of which is the goal (indicated by the star). The two actions, left and right, have the expected 
(deterministic) effect. The goal state is fully observable {observation=goal), while the other three 
states are aliased {observation=none). A reward of +1 is received for being in the goal state, 
otherwise the reward is zero. We assume a discount factor of 7 = 0.75. The initial distribution is 
uniform over non-goal states, and the system resets to that distribution whenever the goal is reached. 




Figure 4: ID POMDP 

The belief set B is always initialized to contain the initial belief 60 ■ Figure 5 shows part of 
the belief tree, including the original belief set (top node), and its envelope (leaf nodes). We now 
consider what each belief expansion method might do. 



bO=[ 1/3 1/3 1/3 ] 




Figure 5: ID POMDP belief tree 

The Random heuristic can pick any belief point (with equal probability) from the entire belief 
simplex. It does not directly expand any branches of the belief tree, but it will eventually put samples 
nearby. 

The Stochastic Simulation with Random Action has a 50% chance of picking each action. 
Then, regardless of which action was picked, there's a 2/3 chance of seeing observation none, and a 
1/3 chance of seeing observation goal. As a result, the SSRA will select: Pr{hnew = bl) = 0.5 * |, 

Pr{hnew = h2) = 0.5 * \, Pr{bnew = b3) = 0.5 * |, Pr{bnew = M) = 0.5 * I 
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The Stochastic Simulation with Greedy Action first needs to know the poUcy at f>o- A. few 
iterations of point-based updates (Section 2.4) applied to this initial (single point) belief set reveal 
that 7r(6o) = left^ As a result, expansion of the belief will greedily select action left with proba- 
bility 1 — e + PI = 0.95 (assuming e = 0.1 and |yl| = 2). Action right will be selected for belief 
expansion with probability = 0.05. Combining this along with the observation probabilities, we 
can tell that SSGA will expand as follows: Pr{bnew = bl) = 0.95 * |, Prihnew = b2) = 0.95 * g, 
Pr{bnew = b3) = 0.05 * f , Pr(6„e«, = M) = 0.05 * |. 

Predicting the choice of Stochastic Simulation with Exploratory Action is slightly more com- 
plicated. Four cases can occur, depending on the outcomes of random forward simulation from bo: 

1. If action left goes to bi {Pr = 2/3) and action right goes to 63 {Pr = 2/3), then 61 will be 
selected because ||6o — fci||i = 4/^ whereas ||6o — 63II1 = 2/3. This case will occur with 
Pr = 4/9. 

2. If action left goes to 61 {Pr = 2/3) and action right goes to 64 {Pr = 1/3), then 64 will be 
selected because | |6o — &4I |i = 2. This case will occur with Pr = 2/9. 

3. If action left goes to 62 {Pr = 1/3) and action right goes to 63 {Pr = 2/3), then 62 will be 
selected because | |&o — &2I |i = 2. This case will occur with Pr = 2/9. 

4. If action left goes to 62 {Pr = 1/3) and action right goes to 64 {Pr = 1/3), then either can 
be selected (since they are equidistant to bo). In this case each 62 and 64 has Pr = 1/18 of 
being selected. 

All told, Pribnew = bl) = 4/9, Pr{bnevj = b2) = 5/18, Pr{bnew = b3) = 0, Pr{bnew = b4) = 

5/18. 

Now looking at belief expansion using Greedy Error Reduction, we need to compute the 
error e{T{bo,a,z)),\/a,z. We consider Equation 28: since B has only one point, bo, then nec- 
essarily b = bo. To estimate a, we apply multiple steps of value backup at bo and obtain 
a = [0.94 0.94 0.92 1.74]. Using 6 and a as such, we can now estimate the error at each can- 
didate belief: e{bi) = 2.93, 6(62) = 4.28, 6(63) = 1.20, 6(64) = 4.28. Note that because B has 
only one point, the dominating factor is their distance to 60. Next, we factor in the observation 
probabilities, as in Eqns 31-32, which allows us to determine that a = left and z = none, and 
therefore we should select bnew = bi. 

In summary, we note that SSGA, SSEA and GER all favor selecting bi, whereas SSRA picks 
each option with equal probabihty (considering that 62 and 64 are actually the same). In general, 
for a problem of this size, it is reasonable to expand the entire belief tree. Any of the techniques 
discussed here will be do this quickly, except RA which will not pick the exact nodes in the belief 
tree, but will select equally good nearby beliefs. This example is provided simply to illustrate the 
different choices made by each strategy. 

5. A Review of Point-Based Approaches for POMDP Solving 

The previous section describes a new class of point-based algorithms for POMDP solving. The idea 
of using point-based updates in POMDPs has been explored previously in the literature, and in this 

4. This may not be obvious to the reader, but it follows directly from the repeated application of equations 20-23. 
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section we summarize the main results. For most of the approaches discussed below, the procedure 
for updating the value function at a given point remains unchanged (as outlined in Section 2.4). 
Rather, the approaches are mainly differentiated by how the belief points are selected, and by how 
the updates are ordered. 

5.1 Exact Point-Based Algorithms 

Some of the earlier exact POMDP techniques use point-based backups to optimize the value func- 
tion over limited regions of the belief simplex (Sondik, 1971; Cheng, 1988). These techniques 
typically require solving multiple linear programs to find candidate belief points where the value 
function is sub-optimal, which can be an expensive operation. Furthermore, to guarantee that an ex- 
act solution is found, relevant beliefs must be generated systematically, meaning that all reachable 
beliefs must be considered. As a result, these methods typically cannot scale beyond a handful of 
states/actions/observations. 

In work by Zhang and Zhang (2001), point-based updates are interleaved with standard dy- 
namic programming updates to further accelerate planning. In this case the points are not generated 
systematically, but rather backups are applied to both a set of witness points and LP points. The 
witness points are identified as a result of the standard dynamic programming updates, whereas 
the LP points are identified by solving linear programs to identify beliefs where the value has not 
yet been improved. Both of these procedures are significantly more expensive than the belief se- 
lection heuristics presented in this paper and results are limited to domains with at most a dozen 
states/actions/observations. Nonetheless this approach is guaranteed to converge to the optimal so- 
lution. 

5.2 Grid-Based Approximations 

There exists many approaches that approximate the value function using a finite set of belief points 
along with their values. These points are often distributed according to a grid pattern over the belief 
space, thus the name grid-based approximation. An interpolation-extrapolation rule specifies the 
value at non-grid points as a function of the value of neighboring grid-points. These approaches 
ignore the convexity of the POMDP value function. 

Performing value backups over grid-points is relatively straightforward: dynamic programming 
updates as specified in Equation 1 1 can be adapted to grid-points for a simple polynomial-time 
algorithm. Given a set of grid points G, the value at each 6*^ G G is defined by: 



If t(6, a, z) is part of the grid, then F(r(6, a, z)) is defined by the value backups. Otherwise, 
y (r(6, a, z)) is approximated using an interpolation rule such as: 



where X{i) > and J2i=i ^(^) = 1- This produces a convex combination over grid-points. The 
two more interesting questions with respect to grid-based approximations are (1) how to calculate 
the interpolation function; and (2) how to select grid points. 




(33) 




(34) 



i=l 
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In general, to find the interpolation that leads to the best value function approximation at a point 
b requires solving the following linear program: 

\G\ 

Minimize ^A(i)y(fef) (35) 

i=l 

\G\ 

Subject to b = Y^ \{i)hf (36) 

i=l 

^A(i) = l (37) 

i=l 

0< A(z) < 1,1 < z < |G|. (38) 

Different approaches have been proposed to select grid points. Lovejoy (1991a) constructs a 
fixed-resolution regular grid over the entire belief space. A benefit is that value interpolations can be 
calculated quickly by considering only neighboring grid-points. The disadvantage is that the number 
of grid points grows exponentially with the dimensionahty of the belief (i.e., with the number of 
states). A simpler approach would be to select random points over the behef space (Hauskrecht, 
1997). But this requires slower interpolation for estimating the value of the new points. Both 
of these methods are less than ideal when the beliefs encountered are not uniformly distributed. 
In particular, many problems are characterized by dense behefs at the edges of the simplex (i.e., 
probability mass focused on a few states, and most other states have zero probability), and low 
belief density in the middle of the simplex. A distribution of grid-points that better reflects the 
actual distribution over belief points is therefore preferable. 

Alternately, Hauskrecht (1997) also proposes using the corner points of the belief simplex (e.g., 
[1 . . . ], [0 1 . . . ], . . . , [0 . . . 1]), and generating additional successor belief points through 
one-step stochastic simulations (Eqn 7) from the comer points. He also proposes an approximate 
interpolation algorithm that uses the values at |5| — 1 critical points plus one non-critical point in the 
grid. An alternative approach is that by Brafman (1997), which builds a grid by also starting with the 
critical points of the belief simplex, but then uses a heuristic to estimate the usefulness of gradually 
adding intermediate points (e.g., bk = 0.56i -I- 0.56j, for any pair of points). Both Hauskrecht's 
and Brafman's methods — generally referred to as non-regular grid approximations — require fewer 
points than Lovejoy's regular grid approach. However the interpolation rule used to calculate the 
value at non-grid points is typically more expensive to compute, since it involves searching over all 
grid points, rather than just the neighboring sub-simplex. 

Zhou and Hansen (2001) propose a grid-based approximation that combines advantages from 
both regular and non-regular grids. The idea is to sub-sample the regular fixed-resolution grid 
proposed by Lovejoy. This gives a variable resolution grid since some parts of the beliefs can be 
more densely sampled than others and by restricting grid points to lie on the fixed-resolution grid 
the approach can guarantee fast value interpolation for non-grid points. Nonetheless, the algorithm 
often requires a large number of grid points to achieve good performance. 

Finally, Bonet (2002) proposes the first grid-based algorithm for POMDPs with e-optimality 
(for any e > 0). This approach requires thorough coverage of the belief space such that every point 
is within (5 of a grid-point. The value update for each grid point is fast to implement, since the 
interpolation rule depends only on the nearest neighbor of the one-step successor belief for each 
grid point (which can be pre-computed). The main limitation is the fact that e-coverage of the belief 
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space can only be attained by using exponentially many grid points. Furthermore, this method 
requires good coverage of the entire belief space, as opposed to the algorithms of Section 4, which 
focus on coverage of the reachable behefs. 

5.3 Approximate Point-Based Algorithms 

More similar to the PBVI-class of algorithms are those approaches that update both the value and 
gradient at each grid point (Lovejoy, 1991a; Hauskrecht, 2000; Poon, 2001). These methods are able 
to preserve the piecewise linearity and convexity of the value function, and define a value function 
over the entire belief simplex. Most of these methods use random beliefs, and/or require the inclu- 
sion of a large number of fixed beliefs such as the corners of the probability simplex. In contrast, the 
PBVI-class algorithms we propose (with the exception of PB VI-f-RA) select only reachable beliefs, 
and in particular those belief points that improve the error bounds as quickly as possible. The idea 
of using reachability analysis (also known as stochastic simulation) to generate new points was ex- 
plored by some of the earlier approaches (Hauskrecht, 2000; Poon, 2001). However their analysis 
indicated that stochastic simulation was not superior to random point placements. We re- visit this 
question (and conclude otherwise) in the empirical evaluation presented below. 

More recently, a technique closely related to PBVl called Perseus has been proposed (Vlassis 
& Spaan, 2004; Spaan & Vlassis, 2005). Perseus uses point-based backups similar to the ones 
used in PBVI, but the two approaches differ in two ways. First, Perseus uses randomly generated 
trajectories through the belief space to select a set of belief points. This is in contrast to the belief- 
point selection heuristics outlined above for PBVl. Second, whereas PBVl systematically updates 
the value at all belief points at every epoch of value iteration, Perseus selects a subset of points 
to update at every epoch. The method used to select points is the following: points are randomly 
sampled one at a time and their value is updated. This continues until the value of all points has 
been improved. The insight resides in observing that updating the a-vector at one point often also 
improves the value estimate of other nearby points (which are then removed from the sampling set). 
This approach is conceptually simple and empirically effective. 

The HSVI algorithm (Smith & Simmons, 2004) is another point-based algorithm, which differs 
from PBVI both in how it picks belief points, and in how it orders value updates. It maintains a lower 
and an upper bound on the value function approximation, and uses it to select belief points. The 
updating of the upper bound requires solving linear programs and is generally the most expensive 
step. The ordering of value update is as follows: whenever a belief point is expanded from the 
belief tree, HSVI updates only the value of its direct ancestors (parents, grand-parents, etc., all the 
way back to the initial belief in the head node). This is in contrast to PBVl which performs a batch 
of belief point expansions, followed by a batch of value updates over all points. In other respects, 
HSVI and PBVI share many similarities: both offer anytime performance, theoretical guarantees, 
and scalability; finally the HSVI also takes reachability into account. We will evaluate empirical 
differences between HSVI and PBVl in the next section. 

Finally, the RTBSS algorithm (Paquet, 2005) offers an online version of point-based algorithms. 
The idea is to construct a belief reachability tree similar to Figure 2, but using the current belief 
as the top node, and terminating the tree at some fixed depth d. The value at each node can be 
computed recursively over the finite planning horizon d. The algorithm can eliminate some subtrees 
by calculating a bound on their value, and comparing it to the value of other computed subtrees. 
RTBSS can in fact be combined with an offline algorithms such as PBVl, where the offline algorithm 
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is used to pre-compute a lower bound on the exact value function; this can be used to increase subtree 
pruning, thereby increasing the depth of the online tree construction and thus also the quality of the 
solution. This online algorithm can yield fast results in very large POMDP domains. However the 
overall solution quality does not achieve the same error guarantees as the offline approaches. 

6. Experimental Evaluation 

This section looks at a variety of simulated POMDP domains to evaluate the empirical performance 
of PBVI. The first three domains — Tiger-grid, Hallway, Hallway2 — are extracted from the estab- 
lished POMDP literature (Cassandra, 1999). The fourth — Tag — was introduced in some of our 
earlier work as a new challenge for POMDP algorithms. 

The first goal of these experiments is to estabUsh the scalability of the PBVI framework; this is 
accomplished by showing that PBVI-type algorithms can successfully solve problems in excess of 
800 states. We also demonstrate that PBVI algorithms compare favorably to alternative approximate 
value iteration methods. Finally, following on the example of Section 4.6, we study at a larger scale 
the impact of the belief selection strategy, which confirms the superior performance of the GER 
strategy. 

6.1 Maze Problems 

There exists a set of benchmark problems commonly used to evaluate POMDP planning algo- 
rithms (Cassandra, 1999). This section presents results demonstrating the performance of PBVI- 
class algorithms on some of those problems. While these benchmark problems are relatively small 
(at most 92 states, 5 actions, and 17 observations) compared to most robotics planning domains, 
they are useful from an analysis point of view and for comparison to previous work. 

The initial performance analysis focuses on three well-known problems from the POMDP liter- 
ature: Tiger-grid (also known as Maze33), Hallway, and Hallway2. All three are maze navigation 
problems of various sizes. The problems are fully described by Littman, Cassandra, and KaelbUng 
(1995a); parameterization is available from Cassandra (1999). 

Figure 6a presents results for the Tiger-grid domain. Replicating earlier experiments by Braf- 
man (1997), test runs terminate after 500 steps (there's an automatic reset every time the goal is 
reached) and results are averaged over 151 runs. 

Figures 6b and 6c present results for the Hallway and Hallway2 domains, respectively. In this 
case, test runs are terminated when the goal is reached or after 251 steps (whichever occurs first), 
and the results are averaged over 251 runs. This is consistent with earlier experiments by Littman, 
Cassandra, and Kaelbling (1995b). 

All three figures compare the performance of three different algorithms: 

1. PBVI with Greedy Error Reduction (GER) belief point selection (Section 4.5). 

2. QMDP (Littman et al., 1995b), 

3. Incremental Pruning (Cassandra, Littman, & Zhang, 1997), 

The QMDP heuristic (Littman et al, 1995b) takes into account partial observability at the cur- 
rent step, but assumes full observability on subsequent steps: 

T^QMDpib) = aigmax^b{s)QMDp{s,a). (39) 
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The resulting policy has some ability to resolve uncertainty, but cannot benefit from long-term 
information gathering, or compare actions with different information potential. QMDP can be seen 
as providing a good performance baseline. For the three problems considered, it finds a policy 
extremely quickly, but the policy is clearly sub-optimal. 

At the other end of the spectrum, the Incremental Pruning algorithm (Zhang & Liu, 1996; Cas- 
sandra et al., 1997) is a direct extension of the enumeration algorithm described above. The princi- 
pal insight is that the pruning of dominated a- vectors (Eqn 19) can be interleaved directly with the 
cross-sum operator (Eqn 16). The resulting value function is the same, but the algorithm is more 
efficient because it discards unnecessary vectors earUer on. While Incremental Pruning algorithm 
can theoretically find an optimal policy, for the three problems considered here it would take far too 
long. In fact, only a few iterations of exact backups were completed in reasonable time. In all three 
problems, the resulting short-horizon policy was worse than the corresponding PBVI policy. 

As shown in Figure 6, PBVI-i-GER provides a much better time/performance trade-off. It finds 
policies that are better than those obtained with QMDP, and does so in a matter of seconds, thereby 
demonstrating that it does not suffer from the same paralyzing complexity as Incremental Pruning. 

While those who take a closer look at these results may be surprised to see that the performance 
of PBVI actually decreases at some points (e.g., the "dip" in Fig. 6c), this is not unexpected. It 
is important to remember that the theoretical properties of PBVI only guarantee a bound on the 
estimate of the value function, but as shown here, this does not necessarily imply that the policy 
needs to improve monotonically. Nonetheless, as the value function converges, so will the policy 
(albeit at a slower rate). 

6.2 Tag Problem 

While the previous section establishes the good performance of PBVI on some well-known simu- 
lation problems, these are quite small and do not fully demonstrate the scalability of the algorithm. 
To provide a better understanding of PBVI's effectiveness for large problems, this section presents 
results obtained when applying PBVI to the Tag problem, a robot version of the popular game of 
lasertag. In this problem, the agent must navigate its environment with the goal of searching for, 
and tagging, a moving target (Rosencrantz, Gordon, & Thrun, 2003). Real-world versions of this 
problem can take many forms, and in Section 7 we present a similar problem domain where an 
interactive service robot must find an elderly patient roaming the corridors of a nursing home. 

The synthetic scenario considered here is an order of magnitude larger (870 states) than most 
other POMDP benchmarks in the hterature (Cassandra, 1999). When formulated as a POMDP prob- 
lem, the goal is for the robot to optimize a policy allowing it to quickly find the person, assuming 
that the person moves (stochastically) according to a fixed policy. The spatial configuration of the 
environment used throughout this experiment is illustrated in Figure 7. 

The state space is described by the cross-product of two position features, Robot = 
{so, . . . , S29} and Person = {so, • • ■ , s found}- Both start in independently-selected random 
positions, and the scenario finishes when Person = Sfmmd- The robot can select from five actions: 
{North, South, East, West, Tag}. A reward of —1 is imposed for each motion action; the Tag action 
results in a +10 reward if the robot and person are in the same cell, or —10 otherwise. Through- 
out the scenario, the Robot's position is fully observable, and a Move action has the predictable 
deterministic effect, e. g.: 

Pr {Robot = sio I Robot = sq, North) = 1, 
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(c) Hallway2 



Figure 6: PBVI performance on well-known POMDP problems. Each figure shows the sum of 
discounted reward as a function of the computation time for a different problem domain. 
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Figure 7: Spatial configuration of the domain 
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and so on for each adjacent cell and direction. The position of the person, on the other hand, 
is completely unobservable unless both agents are in the same cell. Meanwhile at each step, the 
person (with omniscient knowledge) moves away from the robot with Pr = 0.8 and stays in place 
with Pr = 0.2, e. g.: 

Pr{Person = sig | Person = si^SzRobot = sq) = 0.4 
Pr{Person = S20 \ Person = si^SzRohot = sq) = 0.4 
Pr{Person = S15 | Person = si^hRobot = sq) = 0.2. 

Figure 8 shows the performance of PB VI with Greedy Error Reduction on the Tag domain. Re- 
sults are averaged over 1000 runs, using different (randomly chosen) start positions for each run. 
The QMDP approximation is also tested to provide a baseline comparison. The results show a grad- 
ual improvement in PBVI's performance as samples are added (each shown data point represents a 
new expansion of the belief set with value backups). It also confirms that computation time is di- 
rectly related to the number of belief points. PBVI requires fewer than 100 belief points to overcome 
QMDP, and the performance keeps on improving as more points are added. Performance appears to 
be converging with approximately 250 belief points. These results show that a PBVI-cIass algorithm 
can effectively tackle a problem with 870 states. 




TIME (sees) 



Figure 8: PBVI performance on Tag problem. We show the sum of discounted reward as a function 
of the computation time. 

This problem is far beyond the reach of the Incremental Pruning algorithm. A single iteration of 
optimal value iteration on a problem of this size could produce over lO^'^ a-vectors before pruning. 
Therefore, it was not applied. 

This section describes one version of the Tag problem, which was used for simulation purposes 
in our work and that of others (Braziunas & Boutilier, 2004; Poupart & Boutilier, 2004; Smith & 
Simmons, 2004; Vlassis & Spaan, 2004). In fact, the problem can be re-formulated in a variety 
of ways to accommodate different environments, person motion models, and observation models. 
Section 7 discusses variations on this problem using more realistic robot and person models, and 
presents results validated onboard an independently developed robot simulator. 
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6.3 Empirical Comparison of PBVI-Class Algorithms 

Having establish the good performance of PB VI+GER on a number of problems, we now consider 

empirical results for the different PBVI-class algorithms. This allows us to compare the effects 
of the various belief expansion heuristics. We repeat the experiments on the Tiger-grid, Hallway, 
Hallway2 and Tag domains, as outlined above, but in this case we compare the performance of five 
different PBVI-class algorithms: 

1. PBVI+RA: PBVI with belief points selected randomly from belief simplex (Section 4.1). 

2. PBVI+SSRA: PBVI with belief points selected using stochastic simulation with random ac- 
tion (Section 4.2). 

3. PB VI+SSGA: PBVI with belief points selected using stochastic simulation with greedy action 
(Section 4.3). 

4. PBVI+SSEA: PBVI with belief points selected using stochastic simulation with exploratory 
action (Section 4.4). 

5. PB VI+GER: PBVI with belief points selected using greedy error reduction (Section 4.5). 

All PBVI-class algorithms can converge to the optimal value function given a sufficiently large 
set of behef points. But the rate at which they converge depends on their ability to generally pick 
useful points, and leave out the points containing less information. Since the computation time 
is directly proportional to the number of belief points, the algorithm with the best performance is 
generally the one which can find a good solution with the fewest belief points. 

Figure 9 shows a comparison between the performance of each of the five PBVI-class algorithms 
enumerated above on each of the four problem domains. In these pictures, we present performance 
results as a function of computation time.^ 

As seen from these results, in the smallest domain — Tiger-grid — PBVI-i-GER is similar in per- 
formance to the random approach PBVI-i-RA. In the Hallway domain, PBVI-hGER reaches near- 
optimal performance earlier than the other algorithms. In Hallway2, it is unclear which of the five 
algorithms is best, though GER seems to converge earlier. 

In the larger Tag domain, the situation is more interesting. The PBVI-i-GER combination is 
clearly superior to the others. There is reason to beheve that PBVI-i-SSEA could match its perfor- 
mance, but would require on the order of twice as many points to do so. Nonetheless, PBVI-i-SSEA 
performs better than either PBVI-i-SSRA or PBVI+SSGA. With the random heuristic (PBVI-i-RA), 
the reward did not improve regardless of how many belief points were added (4000-I-), and there- 
fore we do not include it in the results. The results presented in Figure 9 suggest that the choice 
of belief points is crucial when dealing with large problems. In general, we believe that GER (and 
SSEA to a lesser degree) is superior to the other heuristics for solving domains with large numbers 
of action/observation pairs, because it has the ability to selectively chooses which branches of the 
reachability tree to explore. 

As a side note, we were surprised by SSGA's poor performance (in comparison with SSRA) on 
the Tiger-grid and Tag domains. This could be due to a poorly tuned greedy bias e, which we did 

5. Nearly identical graphs can be produced showing performance results as a function of the number of belief points. 
This confirms complexity analysis showing that the computation time is directly related to the number of belief 
points. 
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TIME (sees) TIME (sees) 



(c) Hallway2 (d) Tag 

Figure 9: Belief expansion results showing execution performance as a function of the computation 
time. 



not investigate at length. Future investigations using problems with a larger number of actions may 
shed better light on this issue. 

In terms of computational requirement, GER is the most expensive to compute, followed by 
SSEA. However in all cases, the time to perform the belief expansion step is generally negligible 
(< 1%) compared to the cost of the value update steps. Therefore it seems best to use the more 
effective (though more expensive) heuristic. 

The PBVI framework can accommodate a wide variety of strategies, past what is described in 
this paper. For example, one could extract belief points directly from sampled experimental traces. 
This will be the subject of future investigations. 

6.4 Comparative Analysis 

While the results outlined above show that PBVI-type algorithms are able to handle a wide spectrum 
of large-scale POMDP domains, it is not sufficient to compare the performance of PBVI only to 
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QMDP and Incremental Pruning — the two ends of the spectrum — as done in Section 6.1. In fact 
there has been significant activity in recent years in the development of fast approximate POMDP 
algorithms, and so it is worthwhile to spend some time comparing the PBVI framework to these 
alternative approaches. This is made easy by the fact that many of these have been validated using 
the same set of problems as described above. 

Table 8 summarizes the performance of a large number of recent POMDP approximation algo- 
rithms, including PBVI, on the four target domains: Tiger-grid, Hallway, Hallway!, and Tag. The 
algorithms listed were selected based on the availability of comparable pubUshed results or available 
code, or in some cases because the algorithm could be re-implemented easily. 

We compare their empirical performance, in terms of execution performance versus planning, 
on a set of simulation domains. However as is often the case, these results show that there is not a 
single algorithm that is best for solving all problems. We therefore also compile a summary of the 
attributes and characteristics of each algorithm, in an attempt to tell which algorithm may be best 
for what types of problems. Table 8 includes (whenever possible) the goal completion rates, sum 
of rewards, pohcy computation time, number of required belief points, and policy size (number of 
a-vectors, or number of nodes in finite state controllers). The number of belief points and policy 
size are often identical, however the latter can be smaller if a single a-vector is best for multiple 
belief points. 

The results marked [*] were computed by us on a 3GHz Pentium 4; other results were likely 
computed on different platforms, and therefore time comparisons may be approximate at best. 
Nonetheless the number of samples and the size of the final poUcy are both useful indicators of 
computation time. The results reported for PBVI correspond to the earliest data point from Fig- 
ures 6 and 8 where PBVI-hGER achieves top performance. 

Algorithms are listed in order of performance, starting with the algorithm(s) achieving the high- 
est reward. All results assume a standard (not lookahead) controller (see Hauskrecht, 2000, for 
definition). 

Overall, the results indicate that some of the algorithms achieve sub-par performance in terms of 
expected reward. In the case of QMDP, this is because of fundamental limitations in the algorithm. 
While Incremental Pruning and the exact value-directed compression can theoretically reach optimal 
performance, they would require longer computation time to do so. The grid method (see Tiger-grid 
results), BPI (see Tiger-grid, Hallway and Tag results) and PBUA (see Tag results) suffer from a 
similar problem, but offer much more graceful performance degradation. It is worth noting that none 
of these approaches assumes a known initial belief, so in effect they are solving harder problems. 
The results for BBSLS are not sufficiently extensive to comment at length, but it appears to be able 
to find reasonable policies with small controllers (see Tag results). 

The remaining algorithms — HSVI, Perseus, and our own PBVI-i-GER — all offer comparable 
performance on these relatively large POMDP domains. HSVI seems to offer good control perfor- 
mance on the full range of tasks, but requires bigger controllers, and is therefore probably slower, 
especially on domains with high stochasticity (e.g., Tiger-grid, Hallway, Hallway2). The trade-offs 
between Perseus and PBVI-i-GER are less clear: the planning time, controller size and performance 
quality are quite comparable, and in fact the two approaches are very similar. Similarities and 
differences between the two approaches are explored further in Section 7. 



366 



Anytime Point-Based Approximations for Large POMDPs 



Method 


GoaI% 


Reward it Conf.Int. 


Time(s) 


\B\ 


|7r| 


Tiger-Grid (Maze33) 












HSVI (Smith & Sinunons, 2004) 


n.a. 


2.35 


10341 


n.v. 


4860 


Perseus (Vlassis & Spaan, 2004) 


n.a. 


2.34 


104 


10000 


134 


PBUA (Poon, 2001) 


n.a. 


2.30 


12116 


660 


n.v. 


PBVI+GER[*] 


n.a. 


2.27 ±0.13 


397 


512 


508 


BPI (Poupart & Boutilier, 2004) 


n.a. 


1.81 


163420 


n.a. 


1500 


Grid (Brafman, 1997) 


n.a. 


0.94 


n.v. 


\1A 


n.a. 


QMDP (Littman et al., 1995b)[*] 


n.a. 


0.276 


0.02 


n.a. 


5 


IncPrune (Cassandra et al., 1997)[*] 


n.a. 


0.0 


24hrs-i- 


n.a. 


n.v. 


Exact VDC (Poupart & Boutilier, 2003)[*] 


n.a. 


0.0 


24hrs-H 


n.a. 


n.v. 


Hallway 












PBUA (Poon, 2001) 


100 


0.53 


450 


300 


n.v. 


HSVI (Smith & Simmons, 2004) 


100 


0.52 


10836 


n.v. 


1341 


PBVI+GER[*] 


100 


0.51 ± 0.03 


19 


64 


64 


Perseus (Vlassls & Spaan, 2004) 


n.v. 


0.51 


35 


10000 


55 


BPI (Poupart & BoutiUer, 2004) 


n.v. 


0.51 


249730 


n.a. 


1500 


QMDP (Littman et al., 1995h)F*l 


51 


0.265 


0.03 


n.a. 


5 


Exact VDC (Poupart & BoutiUer, 2003)[*] 


39 


0.161 


24hrs-i- 


n.a. 


n.v. 


IncPrune (Cassandra et al., 1997)[*] 


39 


0.161 


24hrs-i- 


n.a. 


n.v. 


Hallway2 












PBVI+GER[*] 


100 


0.37 ± 0.04 


6 


32 


31 


Perseus (Vlassls & Spaan, 2004) 


n.v. 


0.35 


10 


10000 


56 


HSVI (Smith & Simmons, 2004) 


100 


0.35 


10010 


n.v. 


1571 


PBUA (Poon, 2001) 


100 


0.35 


27898 


1840 


n.v. 


BPI (Poupart & Boutilier, 2004) 


n.v. 


0.28 


274280 


n.a. 


1500 


Grid (Brafman, 1997) 


98 


n.v. 


n.v. 


337 


n.a. 


QMDP (Littman et al., 1995b)[*] 


22 


0.109 


1.44 


n.a. 


5 


Exact VDC (Poupart & BoutiUer, 2003)[*] 


48 


0.137 


24hrs-i- 


n.a. 


n.v. 


IncPrune (Cassandra et al., 1997)[*] 


48 


0.137 


24hrs-i- 


n.a. 


n.v. 


Tag 












HSVI (Smith & Simmons, 2004) 


1 f\f\ 

100 


-6.37 


10113 


n.v. 


1 /' CI 

1657 


PBVI+GER[*] 


100 


-6.75 ± 0.39 


8946 


256 


203 


Perseus (Vlassis & Spaan, 2004) 


n.v. 


-6.85 


3076 


10000 


205 


BBSLS (Braziunas & Boutilier, 2004) 


n.v. 


-8.31 


100054 


n.a. 


30 


BPI (Poupart & Boutilier, 2004) 


n.v. 


-9.18 


59772 


n.a. 


940 


QMDP (Littman et al., 1995b)[*] 


19 


-16.62 


1.33 


n.a. 


5 


PBUA (Poon, 2001)[*] 





-19.9 


24hrs-i- 


4096 


n.v. 


IncPrune (Cassandra et al., 1997)[*] 





-19.9 


24hrs-i- 


n.a. 


n.v. 



n.a.=not applicable n.v.=not available [*]=results computed by us 



Table 8: Results of PBVI for standard POMDP domains 
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6.5 Error Estimates 

The results presented thus far suggest that the PBVI framework performs best when using the 

Greedy Error Reduction (GER) technique for selecting belief points. Under this scheme, to de- 
cide which belief points will be included, we estimate an error bound at a set of candidate points 
and then pick the one with the largest error estimate. The error bound is estimated as described in 
Equation 28. We now consider the question of how this estimate evolves as more and more points 
are added. The natural intuition is that with the first few points, error estimates will be very large, 
but as the density of the belief set increases, the error estimates will become much smaller. 

Figure 10 reconsiders the four target domains: Tiger-grid, Hallway, Hallway2 and Tag. In each 
case, we present both the reward performance as a function of the number of belief points (top 
row graphs), and the error estimate of each point selected according the order in which points were 
picked (bottom row graphs). In addition, the bottom graphs also show (in dashed line) a trivial 
bound on the error ||Vt — < '^'""f Z^"*'" , valid for any t-step value function of an arbitrary 
policy. As expected, our bound is typically tighter than the trivial bound. In Tag, this only occurs 
once the number of belief points exceeds the number of states, which is not surprising, given that our 
bound depends on distance between reachable beliefs, and that all states are reachable beliefs in this 
domain. Overall, it seems that there is reasonably good correspondence between an improvement 
in performance and a decrease in our error estimates. We can conclude from this figure that even 
though the PBVI error is quite loose, it can in fact be informative in guiding exploration of the belief 
simplex. 

We note that there is significant variance in our error estimates from one belief point to the next, 
as illustrated by the non-monotonic behavior of the curves in the bottom graphs of Figure 10. This 
behavior can be attributed to a few possibilities. First, there is the fact that the error estimate at 
a given belief is only approximate. And the value function used to calculate the error estimate is 
itself approximate. In addition, there is the fact that new belief points are always selected from the 
envelope of reachable beliefs, not from the set of all reachable beliefs. This suggests that GER could 
be improved by maintaining a deeper envelope of candidate belief points. Currently the envelope 
contains those points that are 1-step forward simulations from the points already selected. It may 
be useful to consider points 2-3 steps ahead. We predict this would reduce the jaggedness seen in 
Figure 10, and more importantly, also reduce the number of points necessary for good performance. 
Of course, the tradeoff between the time spent selecting points and the time spent planning would 
have to be re-evaluated under this hght. 

7. Robotic Applications 

The overall motivation behind the work described in this paper is the desire to provide high-quality 
robust planning for real-world autonomous systems, and in particular for robots. On the practi- 
cal side, our search for a robust robot controller has been in large part guided by the Nursebot 
project (Pineau, Montermerlo, Pollack, Roy, & Thrun, 2003). The overall goal of the project is to 
develop personalized robotic technology that can play an active role in providing improved care 
and services to non-institutionalized elderly people. Pearl, shown in Figure 11, is the main robotic 
platform used for this project. 

From the many services a nursing-assistant robot could provide (Engelberger, 1999; Lacey 
& Dawson-Howe, 1998), much of the work to date has focused on providing timely cognitive re- 
minders (e.g., medications to take, appointments to attend, etc.) to elderly subjects (Pollack, 2002). 
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10° 10' lo'^ 10' 10° 10' lo"^ 10' 10° 10' 10' 10^ 10° 10' 10^ 10° 



# belief points # beiief points # beiief points # beiief points 

Figure 10: Sum of discounted reward (top graphs) and estimate of the bound on the error (bottom 
graphs) as a function of the number of selected belief points. 




Figure 1 1 : Pearl the Nursebot, interacting with elderly people at a nursing facility 

An important component of this task is finding the patient whenever it is time to issue a reminder. 
This task shares many similarities with the Tag problem presented in Section 6.2. In this case, 
however, a robot-generated map of a real physical environment is used as the basis for the spatial 
configuration of the domain. This map is shown in Figure 12. The white areas correspond to free 
space, the black lines indicate walls (or other obstacles) and the dark gray areas are not visible or 
accessible to the robot. One can easily imagine the patient's room and physiotherapy unit lying at 
either end of the corridor, with a common area shown in the upper-middle section. 

The overall goal is for the robot to traverse the domain in order to find the missing patient and 
then deliver a message. The robot must systematically explore the environment, reasoning about 
both spatial coverage and human motion patterns, in order to find the person. 
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7.1 POMDP Modeling 

The problem domain is represented jointly by two state features: RobotPosition, PersonPosition. 
Each feature is expressed through a discretization of the environment. Most of the experiments 
below assume a discretization of 2 meters, which means 26 discrete cells for each feature, for a total 
of 676 states. 

It is assumed that the person and robot can move freely throughout this space. The robot's 
motion is deterministically controlled by the choice of action {North, South, East, West). The robot 
has a fifth action (DeliverMessage), which concludes the scenario when used appropriately (i.e., 
when the robot and person are in the same location). 

The person's motion is stochastic and falls in one of two modes. Part of the time, the person 
moves according to Brownian motion (e.g., moves in each cardinal direction with Pr = 0.1, other- 
wise stays put). At other times, the person moves directly away from the robot. The Tag domain of 
Section 6.2 assumes that the person always moves always moves away the robot. This is not real- 
istic when the person cannot see the robot. The current experiment instead assumes that the person 
moves according to Brownian motion when the robot is far away, and moves away from the robot 
when it is closer (e.g., < 4m). The person policy was designed this way to encourage the robot to 
find a robust policy. 

In terms of state observability, there are two components: what the robot can sense about its 
own position, and what it can sense about the person's position. In the first case, the assumption 
is that the robot knows its own position at all times. While this may seem like a generous (or 
optimistic) assumption, substantial experience with domains of this size and maps of this quality 
have demonstrated robust localization abilities (Thrun et al., 2000). This is especially true when 
planning operates at relatively coarse resolution (2 meters) compared to the localization precision 
(10 cm). While exact position information is assumed for planning in this domain, the execution 
phase (during which we actually measure performance) does update the belief using full localization 
information, which includes positional uncertainty whenever appropriate. 

Regarding the detection of the person, the assumption is that the robot has no knowledge of 
the person's position unless s/he is within a range of 2 meters. This is plausible given the robot's 
sensors. However, even in short-range, there is a small probability {Pr = 0.01) that the robot will 
miss the person and therefore return a false negative. 

In general, one could make sensible assumptions about the person's likely position (e.g., based 
on a knowledge of their daily activities), however we currently have no such information and there- 
fore assume a uniform distribution over all initial positions. The person's subsequent movements 
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are expressed through the motion model described above (i.e., a mix of Brownian motion and pur- 
poseful avoidance). 

The reward function is straightforward: R = —1 for any motion action, R = 10 when the 
robot decides to DeliverMessage and it is in the same cell as the person, and R = — 100 when 
the robot decides to DeliverMessage in the person's absence. The task terminates when the robot 
successfully delivers the message (i.e., a = DeliverMessage and Srobot = Sperson)- We assume a 
discount factor of 0.95. 

We assume a known initial belief, 6o, consisting of a uniform distribution over all states. This 
is used both for selecting beUef points during planning, and subsequently for executing and testing 
the final policy. 

The initial map (Fig. 12) of the domain was collected by a mobile robot, and slightly cleaned 
up by hand to remove artifacts (e.g., people walking by). We then assumed the model parameters 
described here, and appUed PBVI planning to the problem as such. Value updates and belief point 
expansions were applied in alternation until (in simulation) the policy was able to find the person 
on 99% of trials (trials were terminated when the person is found or after 100 execution steps). The 
final policy was implemented and tested onboard the publicly available CARMEN robot simula- 
tor (Montemerlo, Roy, & Thrun, 2003). 

7,2 Comparative Evaluation of PBVI and Perseus 

The subtask described here, with its 626 states, is beyond the capabilities of exact POMDP solvers. 
Furthermore, as will be demonstrated below, MDP-type approximations are not equipped to handle 
uncertainty of the type exhibited in this task. The main purpose of our analysis is to evaluate the 
effectiveness of the point-based approach described in this paper to address this problem. While the 
results on the Tag domain (Section 6.2) hint at the fact that PBVI and other algorithms may be able 
to handle this task, the more realistic map and modified motion model provide new challenges. 

We begin our investigation by directly comparing the performance of PBVI (with GER be- 
lief points selection) with that of the Perseus algorithm on this complex robot domain. Perseus 
was described in Section 5; results presented below were produced using code provided by its au- 
thors (Perseus, 2004). Results for both algorithms assume that a fixed POMDP model was generated 
by the robot simulator. This model is then stored and solved offline by each algorithm. 

PBVI and Perseus each have a few parameters to set. PBVI requires: number of new belief 
points to add at each expansion (Badd) and planning horizon for each expansion (H). Perseus 
requires: number of belief points to generate during random walk (B) and the maximum planning 
time (T). Results presented below assume the following parameter settings: B^dd = 30, i7 = 25, 
5=10,000, r=1500. Both algorithms were fairly robust to changes in these parameters.^ 

Figure 13 summarizes the results of this experiment. These suggest a number of observations. 

• As shown in Figure 13(a), both algorithms find the best solution in a similar time, but 
PBVI-i-GER has better anytime performance than Perseus (e.g., a much better poUcy is found 
given only 100 sec). 

• As shown in Figure 13(b), both algorithms require a similar number of a-vectors. 

• As shown in Figure 13(c), PBVI-i-GER requires many fewer behefs. 

6. A ±25% change in parameter value yielded sensibly similar results in terms of reward and number of a vectors, 
though of course the time, memory, and number of beliefs varied. 
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TIME (sees) TIME (sees) 



(c) (d) 



Figure 13: Comparison of PBVI and Perseus on robot simulation domain 



• Because it requires fewer beliefs, PBVI+GER has much lower memory requirements; this is 
quantified in Figure 13(d). 

These new results suggest that PBVI and Perseus have similar performance if the objective is to 
find a near-optimal solution, and time and memory are not constrained. In cases where one is willing 
to trade off accuracy for time, then PBVI may provide superior anytime performance. And in cases 
where memory is hmited, PBVI's conservative approach with respect to belief point selection is 
advantageous. Both these properties suggest that PBVI may scale better to very large domains. 



7.3 Experimental Results with Robot Simulator 

The results presented in above assume that the same POMDP model is used for planning and test- 
ing (i.e., to compute the reward in Figure 13(a)). This is useful to carry out a large number of 
experiments. The model however cannot entirely capture the dynamics of a realistic robot system, 
therefore there is some concern that the poUcy learned by point-based methods will not perform as 
well on a realistic robot. To verify the robustness of our approach, the final PBVI control policy 
was implemented and tested onboard the pubhcly available CARMEN robot simulator (Montemerlo 
et al, 2003). 
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The resulting policy is illustrated in Figure 14. This figure shows five snapshots obtained from 
a single run. In this particular scenario, the person starts at the far end of the left corridor. The 
person's location is not shown in any of the figures since it is not observable by the robot. The 
figure instead shows the belief over person positions, represented by a distribution of point samples 
(grey dots in Fig. 14). Each point represents a plausible hypothesis about the person's position. The 
figure shows the robot starting at the far right end of the corridor (Fig. 14a). The robot moves toward 
the left until the room's entrance (Fig. 14b). It then proceeds to check the entire room (Fig. 14c). 
Once relatively certain that the person is nowhere to be found, it exits the room (Fig. 14d), and 
moves down the left branch of the corridor, where it finally finds the person at the very end of the 
corridor (Fig. 14e). 

This policy is optimized for any start position (for both the person and the robot). The scenario 
shown in Figure 14 is one of the longer execution traces since the robot ends up searching the entire 
environment before finding the person. It is interesting to compare the choice of action between 
snapshots (b) and (d). The robot position in both is practically identical. Yet in (b) the robot chooses 
to go up into the room, whereas in (d) the robot chooses to move toward the left. This is a direct 
result of planning over beliefs, rather than over states. The beUef distribution over person positions 
is clearly different between those two cases, and therefore the policy specifies a very different course 
of action. 

Figure 15 looks at the policy obtained when solving this same problem using the QMDP heuris- 
tic. Four snapshots are offered from different stages of a specific scenario, assuming the person 
started on the far left side and the robot on the far right side (Fig. 15a). After proceeding to the 
room entrance (Fig. 15b), the robot continues down the corridor until it almost reaches the end 
(Fig. 15c). It then turns around and comes back toward the room entrance, where it stations itself 
(Fig. 15d) until the scenario is forcibly terminated. As a result, the robot cannot find the person 
when s/he is at the left edge of the corridor or in the room. What's more, because of the running- 
away behavior adopted by the subject, even when the person starts elsewhere in the corridor, as the 
robot approaches the person will gradually retreat to the left and similarly escape from the robot. 

Even though QMDP does not explicitly plan over beliefs, it can generate different policy actions 
for cases where the state is identical but the belief is different. This is seen when comparing Fig- 
ure 15 (b) and (d). In both of these, the robot is identically located, however the belief over person 
positions is different. In (b), most of the probability mass is to the left of the robot, therefore it trav- 
els in that direction. In (d), the probabihty mass is distributed evenly between the three branches 
(left corridor, room, right corridor). The robot is equally pulled in all directions and therefore stops 
there. This scenario illustrates some of the strength of QMDP. Namely, there are many cases where 
it is not necessary to explicitly reduce uncertainty. However, it also shows that more sophisticated 
approaches are needed to handle some cases. 

These results show that PBVI can perform outside the bounds of simple maze domains, and 
is able to handle realistic problem domains. In particular, throughout this evaluation, the robot 
simulator was in no way constrained to behave as described in our POMDP model (Sec. 7.1). This 
means that the robot's actions often had stochastic effects, the robot's position was not always 
fully observable, and that befief tracking had to be performed asynchronously (i.e., not always a 
straightforward ordering of actions and observations). Despite this misalignment between the model 
assumed for planning, and the execution environment, the control pohcy optimized by PBVI could 
successfully be used to complete the task. 
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Figure 15: Example of a QMDP policy failing to find the person 
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8. Discussion 

This paper describes a class of anytime point-based POMDP algorithms called PBVI, which com- 
bines point-based value updates with strategic selection of belief points, to solve large POMDPs. 
Further extensions to the PBVI framework, whereby value updates are applied to groups of belief 
points according to their spatial distribution, are described by (Pineau, Gordon, & Thrun, 2004). 
The main contributions pertaining to the PBVI framework are now summarized. 

Scalability. The PBVI framework is an important step towards truly scalable POMDP solutions. 
This is achieved by bounding the policy size through the selection of a small set of belief points. 

Anytime planning. PBVl-class algorithms alternates between steps of value updating and steps 
of belief point selection. As new points are added, the solution improves, at the expense of increased 
computational time. The trade-off can be controlled by adjusting the number of points. The algo- 
rithm can be terminated either when a satisfactory solution is found, or when planning time is 
elapsed. 

Bounded error. We provide a theoretical bound on the error of the approximation introduced 
in the PBVI framework. This result holds for a range of beUef point selection methods, and lead 
directly to the development of a new PBVl-type algorithm: PBVl+GER, where estimates of the 
error bound are used directly to select belief points. Furthermore we find that the bounds can be 
useful in assessing when to stop adding belief points. 

Exploration. We proposed a set of new point selection heuristics, which explore the tree of 
reachable beliefs to select useful belief points. The most successful technique described, Greedy 
Error Reduction (GER), uses an estimate of the error bound on candidate belief points to select the 
most useful points. 

Improved empirical performance. PBVI has demonstrated the ability to reduce planning time 
for a number of well-known POMDP problems, including Tiger-grid, Hallway, and Hallway2. By 
operating on a set of discrete points, PBVI algorithms can perform polynomial-time value updates, 
thereby overcoming the curse of history that paralyzes exact algorithms. The GER technique used to 
select points allows us to solve large problems with fewer belief points than alternative approaches. 

New problem domain. PBVI was applied to a new POMDP planning domain (Tag), for which 
it generated an approximate solution that outperformed baseline algorithms QMDP and Incremental 
Pruning. This new domain has since been adopted as a test case for other algorithms (Vlassis & 
Spaan, 2004; Smith & Simmons, 2004; Braziunas & Boutiher, 2004; Poupart & Boutiher, 2004). 
This fosters an increased ease of comparison between new techniques. Further comparative analysis 
was provided in Section 7.2 highhghting similarities and differences between PBVI and Perseus. 

Demonstrated performance. PBVI was applied in the context of a robotic search-and-rescue 
type scenario, where a mobile robot is required to search its environment and find a non-stationary 
individual. PBVI's performance was evaluated using a realistic, independently-developed, robot 
simulator. 

A significant portion of this paper is dedicated to a thorough comparative analysis of point-based 
methods. This includes evaluating a range of point-based selection methods, as well as evaluating 
mechanisms for ordering value updates. The comparison of point-based selection techniques sug- 
gest that the GER method presented in Section 4.5 is superior to more naive techniques. In terms of 
ordering of value updates, the randomized strategy which is used in the Perseus algorithm appears 
effective to accelerate planning. A natural next step would be to combine the GER belief selection 
heuristic with Perseus's random value updates. We performed experiments along these lines, but 
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did not achieve any significant speed-up over the current performance of PBVI or Perseus (e.g., as 
reported in Figure 13(a)). It is likely that when belief points are chosen carefully (as in GER), each 
of these points needs to be updated systematically and therefore there is no additional benefit to 
using randomized value updates. 

Looking towards the future, it is important to remember that while we have demonstrated the 
ability to solve problems which are large by POMDP standards, many real-world domains far exceed 
the complex of domains considered in this paper. In particular, it is not unusual for a problem to be 
expressed through a number of multi- valued state features, in which case the number of states grows 
exponentially with the number of features. This is of concern because each belief point and each 
a-vector has dimensionality IS*! (where \S\ is the number of states) and all dimensions are updated 
simultaneously. This is an important issue to address to improve the scalabihty of point-based value 
approaches in general. 

There are various existing attempts at overcoming the curse of dimensionality in POMDPs. 
Some of these — e. g. the belief compression techniques by Roy and Gordon (2003) — cannot be 
incorporated within the PBVI framework without compromising its theoretical properties (as dis- 
cussed in Section 3). Others, in particular the exact compression algorithm by Poupart and Boutilier 
(2003), can be combined with PBVI. However, preliminary experiments in this direction have 
yielded Uttle performance improvement. There is reason to believe that approximate value com- 
pression would yield better results, but again at the expense of forgoing PBVI's theoretical proper- 
ties. The challenge therefore is to devise function-approximation techniques that both reduce the 
dimensionality effectively, while maintaining the convexity properties of the solution. 

A secondary (but no less important) issue concerning the scalability of PBVI pertains to the 
number of belief points necessary to obtain a good solution. While problems addressed thus far can 
usually be solved with OdS"]) belief points, this need not be true. In the worse case, the number 
of beUef points necessary may be exponential in the plan length. The PBVI framework can accom- 
modate a wide variety of strategies for generating belief points, and the Greedy Error Reduction 
technique seems particularly effective. However this is unlikely to be the definitive answer to belief 
point selection. In more general terms, this relates closely to the well-known issue of exploration 
versus exploitation, which arises across a wide array of problem-solving techniques. 

These promising opportunities for future research aside, the PBVI framework has already 
pushed the envelope of POMDP problems that can be solved with existing computational resources. 
As the field of POMDPs matures, finding ways of computing policies efficiently will likely continue 
to be a major bottleneck. We hope that point based algorithms such as the PBVI will play a leading 
role in the search for more efficient algorithms. 
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